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Abstract In our paper [18] we introduced a special kind of fc-width junction tree, 
called fc-th order t-cherry junction tree in order to approximate a joint probability 
distribution. The approximation is the best if the KuUback-Leibler divergence between 
the true joint probability distribution and the approximating one is minimal. Finding 
the best approximating fc-width junction tree is NP-complete if > 2 (see in [12]). In 
[19j we also proved that the best approximating fc-width junction tree can be embedded 
into a fc-th order t-cherry junction tree. We introduce a greedy algorithm resulting very 
good approximations in reasonable computing time. 

In this paper we prove that if the Markov network underlying fuUfills some require- 
ments then our greedy algorithm is able to find the true probability distribution or 
its best approximation in the family of the fc-th order f-cherry tree probability distri- 
butions. Our algorithm uses just the fc-th order marginal probability distributions as 
input. 

We compare the results of the greedy algorithm proposed in this paper with the 
greedy algorithm proposed by Malvestuto |16) . 
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1 Introduction 

The problem of approximating multivariate probability distributions is a central task 
of many fields. Unfortunately in most of the cases we know nothing about the theoret- 
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ical probability distribution. It is useful to exploit the dependence structure between 
the random variables involved. The problem is: what should we do when correlation 
matrices can not be used. 

Starting from a discrete probability distribution, for example from a sample data, 
it is useful to discover some of the conditional independences between the variables. 

The Markov networks (Markov random fields) and Bayesian networks encode these 
conditional independences. In our paper we focus on the Markov networks. If the 
graph structure of the Markov network is known, many procedures were developed for 
its inference, see [T^ and [8]. There are many cases where the graph structure of the 
Markov network is unknown. In [18] we proposed a method for discovering some of 
the conditional independences between the random variables by fitting a special type 
of multivariate probability distribution called t-cherry junction tree distribution to the 
sample data. The goodness of fit was quantified by the Kullback-Leibler divergence 
(see [14]). This relates the problem to information theory ([7]). On the other side, the 
graph underlying the Markov network links the problem to graph theory. For elements 
of graph theory see ^ . 

In the second section we introduce some concepts used in graph theory and proba- 
bility theory that we need throughout the paper and present how these can be linked 
to each other. For a good overview see |15| . 

In the third part we introduce the Szantai-Kovacs's greedy algorithm which start- 
ing from the fc-th order marginal probability distributions gives a fc-th order t-cherry 
junction tree probability distribution as a result. For the same task Malvestuto gives 
another algorithm in [T^] . First we compare these two algorithms from analytical point 
of view and then apply them on the example problem presented in Malvestuto's paper 

m- 

In the fourth part we introduce the so called puzzle algorithm for fc-th order t- 
cherry trees. This results in a puzzle numbering of the verticies. Using this we give 
some theoretical results related to our greedy algorithm. 

The last part contains conclusions and some possible applications of our greedy 
algorithm. 

2 Preliminaries 

This part contains a summary of the concepts used throughout the paper. We first 
present the acyclic hypergraphs and junction trees. We then present a short reminder on 
Markov network. We finish this part with the multivariate joint probability distribution 
associated to a junction tree. 

Let V = {1, . . . ,d} be a set of vertices and F a set of subsets of V called set of 
hyperedges. A hypergraph consists of a set V of vertices and a set F of hyperedges. 
We denote a hyperedge by Cj, where Ci is a subset of V. If two vertices are in the 
same hyperedge they are connected, which means, the hyperedge of a hyperhraph is a 
complete graph on the set of vertices contained in it. 

A vertex is called simplicial if it belongs to precisely one hyperedge. 

An ordering of the vertices is a perfect elimination ordering if Vi, 1 < i < d the 
vertex i is simplicial in the subhypergraph defined on the vertices {i, i ~\- 1, . . . ,d} . 

The acyclic hypergraph is a special type of hypergraph which fulfills the following 
requirements: 

— Neither of the edges of F is a subset of another edge. 
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— There exists a numbering of edges for which the running intersection property is 
fuUfiled: Vj > 2 3 i < j : Q D Cj n (Ci U . . . U C^-i). (Other formulation is 
that for all hyperedges Ci and Cj with i < j — 1, Ci(^Cj C Cs for all s, z < s < j.) 

Let Sj = Cj n (Ci U . . . U Cj-i), for j > 1 and Si = Let _Rj = C^ASj. We say 
that Sj separates Rj from (Ci U . . . U Cj-i) \S'j, and call 5j separator set or shortly 
separator. 

Now we link these concepts to the terminology of junction trees. 

The junction tree is a special tree stucture which is equivalent to the connected 
acyclic hypergraphs [T5]. The nodes of the tree correspond to the hyperedges of the 
connected acyclic hypergraph and are called clusters, the edges of the tree correspond 
to the separator sets and called separators. The set of all clusters is denoted by C, 
the set of all separators is denoted by S. The junction tree with the largest cluster 
containing k variables is called k-width junction tree. 

An important relation between graphs and hypergraphs is given in [15] : A hy- 
pergraph is acyclic if and only if it can be considered to be the set of cliques of a 
triangulated graph (a graph is triangulated if every cycle of legth greater than 4 has a 
chord) . 

Theorem 1 (Fulkerson and Gross, l^): A graph is an acyclic hypergraph (triangulated 
graph or junction tree) if and only if has an perfect elimination ordering. 

Algorithm 1 (Graham, 10 ) A Graham reduction of a hypergraph H = {V, F) is 
defined by applying the following two operations to H until they can be applied no 
more. 

— Node removal: If a node appears in only one hyperedge, delete it from V and from 
the edge. 

— Hyperedge removal: In the the transformed hyperedge set, delete a hyperedge if it 
is subset of another hyperedge. 

In [T] is shown that a hypergraph reducies to nothing by this process if and only if 
the hypergraph is acyclic. 

In the Figure [T] one can see a) a triangulated graph, b) the corresponding acyclic 
hypergraph and c) the corresponding junction tree. 

We consider the random vector X = {Xi, . . . , X^)"^ , with the set of indicies V = 
{1, . . . , d}. Roughly speaking a Markov network encodes the conditional independences 
between the random variables. The graph structure associated to a Markov network 
consists in the set of nodes V, and the set of edges E = {{i,j) \i,j G V}. We say the 
graph structure associated to the Markov network has 

— the pairwise Markov (PM) property if Vi,j £V, i not connected to j implies that 
Xi and Xj are conditionally independent given all the other random variables; 

— the local Markov (LM) property if Vi € V, and Ne{i) the neighbourhood of node 
i in the graph (the nodes connected with i) then Xi is conditionally independent 
from all Xj, j ^ A^e (i), given Xf., k G Ne (i); 

— the global Markov (GM) property states that if in the graph \fA, B,C C V and C 
separates A and B in terms of graph then X^and X^ are conditionally independent 
given X(7, which means in terms of probabilities that 

Pi^AuBUc) = pj^^ 
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Fig. 1 a) Triangulated graph, b) The corresponding acyclic hypergraph, c) The corresponding 
junction tree 



— the factorization (F) property states that if C denotes the set of cliques of the graph 
(maximum complete graphs) then there exist positive functions (Xc) that 

CGC 

The following implication is well known IT: F => GM => LM PM. The 
Hammersley-Clifford theorem states that under assumption of positivity PM =^ F. 
However positivity is a very strong condition. "The positivity condition is mathemati- 
cally convenient; But it hardly seems necessary" [11] , In this paper we focus on Markov 
network characterized by the global Markov property. 

The concept of junction tree probability distribution is related to the junction tree 
graph and to the global Markov property of the graph. A junction tree probability 
distribution is defined as a product and division of marginal probability distributions 
as follows: 

n po^c) 
n [p(xsr--i' 

ses 

where C is the set of clusters of the junction tree, S is the set of separators, vg is 
the number of those clusters which contain the separator S. We emphasize here that 
the equalities written as -P(X) = f{P{'X.x),K £ C), where / : i7x ~^ R hold for any 
possible realization of X. 

Example 1 The probability distribution corresponding to Figure [T] is: 

p,y■^ _ J^(X{1.2.3})P(X{,.3.4l)P(X{3.4,5}) 
^ ^'(X{.,3,)P(X,3,4,) 

_ P(Xi,X2,X3)P(X2,X3,X4)P(X3,X4,X5) 
P{X2,X3)P{X3,Xi) 
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In our paper [18] we introduced a special kind of fc-width junction tree, called fc-th 
order t-cherry junction tree in order to approximate a joint probability distribution. 
The fc-th order t-cherry junction tree probability distribution is associates to the fe-th 
order t-cherry tree, introduced in [3], [5]. 

Definition 1 The recursive construction of the k-th order t-cherry tree: 

— (i) The complete graph of (k ~ 1) nodes from V represent the smallest fc-th order 
t-cherry tree; 

— (ii) By connecting a new vertex £ V , with all {ii, . . . , i^-i} vertices of a (fc — 1)- 
dimensional complete subgraph of the existing fc-th order t-cherry tree, we obtain a 
new fc-th order t-cherry tree, {{ifc} {ii, • • . , ik-i}} is called fc-th order hypercherry. 

— (iii) A fc-th order t-cherry tree can be obtained from (i) by successive application 
of (ii). 

The fc-th order t-cherry tree is a special triangulated graph therefore a junction 
tree structure is associated to it. 

Definition 2 f[18j'l The k-th order t-cherry junction tree is defined in the following 
way: 

— By using Definition [1] we construct a fc-th order t-cherry tree over V . 

— To each hypercherry {{ifc} {ii, . . . , is assigned a cluster {ii, . . . , i/^} 
which a node of the junction tree and a separator {ii, . . . , which is an edge 
of the junction tree. 

We denote by C^j^, and iS^jj, the set of clusters and separators of the t-cherry 
junction tree. 

Definition 3 ([TS]) If the indices of the random vector X"^ — (Xi, . . . ,X^) are as- 
signed to a t-cherry junction tree structure then there exists a probability distribution 
called t-cherry junction tree probability distribution given by: 

n pi^c) 
n {Pi^sW'-^' 

^e^ch 

Remark 1 The marginal probability distributions involved in the above formula are 
marginal probability distributions of P (X). 

Example [1] shows a 3-rd order t-cherry junction tree probability distribution. 

In the following instead of probability distribution associated to a junction tree 
we will use shortly junction tree pd and similarly instead of fc-th order t-cherry tree 
junction tree distribution we will use shortly fc-th order t-cherry pd. Recently we found 
a paper [16] where Malvestuto introduced the same junction tree pd structure in a 
different way and named it elementary model of rank k. 

The graph underlying the Markov network is usually unknown, the task of the 
following section is to give a greedy algorithm, for finding a junction tree starting from 
the fe-th order marginal distributions, which are supposed to be known. 



n-ch(X) = 
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3 Szantai-Kovacs's greedy algorithm for finding an approximating 
junction tree probability distribution 

The problem is finding a fc-width junction tree pd which gives the best approximation 
for a discrete probability distribution -P(X). The goodness of the approximation is 
quantified by the KuUback-Leibler divergence, which have to be minimized: 

KL (P (X) , Pa (X)) = ^ P (X) log2 ^ min . 

X 

This minimization problem for fc > 2 can be solved in exact way only by exhaustive 
search [T^- For k—2 the problem can be solved using Kruskall's algorithm, as was first 
proposed by Chow and Liu [6]. 

Malvestuto [16] and Szantai et.al. [19] proved independently and in diflerent ways 
the following statement: If P'^(X) is a fc-width junction tree pd approximation then 
there exists P('L^^(X) a fc-th order t-cherry tree pd which gives at least as good ap- 
proximation as P'^(X) does i.e.: 

KL (P(X),P'^-(X)) >KL (P(X),P^,„(X)) . 

Hence this result we consider as search space the fc-th order t-cherry junction tree pd's. 

In this part we first give a greedy algorithm to minimize the KuUback-Leibler 
divergence between the true probability distribution and a t-cherry junction tree pd 
given the fc-th order marginal probability distributions. We then compare our algorithm 
with Malvestuto's algorithm from analytical point of view. Then we apply the two 
algorithms to the same sample data proposed in |16] . 

In T8 the authors give the following theorem. 

Theorem 2 The KuUback-Leibler divergence between the true P(X) and the approx- 
imation given by the k-width junction tree probability distribution P(Xj), determined 
by the set of clusters C and the set of separators S is : 

KL (P (X) , Pj (X)) = -H{X)^(j:i (Xc) -Ei'^s-l)I (Xs) 

i=l 

where /(Xp) ~ ^ H (Xi) — H (Xf) represents the information content of the random 
iec 

vector X(7 and similarly I(X.s) ~ X/ ^ i-^i) ^ H [X.^) represents the information 

ies 

content of the random vector X5. 

d 

In Formula U]) —H (X) -\- H (Xi) = I (X) is independent from the structure 

of the junction tree. It is easy to see that minimizing the KuUback-Leibler divergence 
means maximizing ^ I (Xf^) — ^ {vg — 1) / (X5). We call this sum as weight of the 

junction tree pd. As larger this weight is, as better fits the approximation associated to 
the junction tree pd to the true probability distribution. It is well known that KL — 
ifP(X) = Pj(X). 
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In the case when the approximating probability distribution is given by a fc-th order 
t-cherry junction tree pd all of the clusters contain k and all of the separators contain 
k — 1 vertices in Formula ([T]). 

Let X — {Xi, . . . , X^}a. set of random variables. 

Definition 4 We define the following concepts: 

— the search space: 

E = {x^ki^l,...,^k-l) = {i^ij ' {^ii. • • • >^ife-i}} l^n,- ■ • ,^Jfc_i,^ife £ X} , 

— the independence set: 

T = (j)U {t — cherry junction tree structure}, 

— the weight function: 

w.E^R tt;(x^,(^i,....^,_i)) ^ H^n'■■■^Xik-l'X^k) ~ I [Xi^, ■ ■ ■ , Xi^_^) . 

Algorithm 2 Szantai-Kovacs's greedy algorithm. 

Input: Elements of E and their weights which can be calculated based on the fc-th 
order marginal probability distributions. 

Output: set A which contains the clusters of the fc-th order t-cherry juntion tree pd 
and the wheight of the fc-th order t-cherry junction tree pd. 

The algorithm: 

A:=cj> 

Sort E into monotonically decreasing order by wheight w; 
Choose X — argmax^.^^ {w (a;)); 

let A := ^ U {a;} ; E:=E\{x}; w := I (x); 
Do for each x £ E taken in monotonically decreasing order 

if A U {x} e then let A := yl U {x} ; E:^E\{x}; w := w + w (x) ; 

if the union of subsets of A is X, then Stop; 

else take the next element of E. 

In our f-cherry juntion tree terminology the KL divergence formula used by Malves- 
tuto in his paper [TB] is: 

KL (P (X) , Pt_ch (X)) ^-H{X) + J2h (Xc) -Y,i^S-l)H (Xs) . (2) 

cec ses 

In order to minimize the KL divergence Malvestuto had to minimize 
cec SGS 

in a greedy way. 

Malvestuto's algorithm uses the same search space E and independence set J^. The 
wheight function however is different: 

uj: E ^ R UJ = H {Xi^,.. . , X^^^_^, Xi^) ~H [X^^,. . .,Xi^_^). 
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Algorithm 3 Malvestuto's greedy algorithm. 

Input: Elements of E and their weights which can be calculated based on the fc-th 
order marginal probability distributions. 

Output: set A which contains the clusters of the fc-th order f-cherry juntion tree 
probability distribution and the wheight of the fc-th order f-cherry junction tree. 

A:=cj> 

Sort E into monotonically increasing order by wheight w\ 
Chose X — argmin^g^ {H (x)) ; 

let A:= Au{x}; E:=E\{x}; ui := H (x) ; 
Do for each x £ E taken in monotonically increasing order 

a Au{x} e T then let A:^ Au{x}- E := E\{x}; ui := u + u (x) ; 

if the union of subsets of A is A, then Stop; 

else take the next element of E. 

We present experimental results on the application of the two algorithms to the 
probability distribution obtained from the sample data published in the paper [16j . 
These data contain informations on the structural habitat of grahami and opalinus 
lizards. They were published originally by Bishop et al ^ and we give them in Table 

m 

Table 1 Counts in structural habitat categories for Graham and Opalinus lizards 



Cell (Xi , X2 , Xs , X4 , Xs ) Observed Cell (Xi , X2, X3, X4., X5) Observed 
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The data consists of observed counts for perch height (< 2' or > 2')-Ai, perch 
diameter (< 5" or > 5")-A2, insolation (sun, shade)- A3, time of day categories (early, 
midday, late) -A4, lizard type (grahami, opalinus)-A5 . The size of the contingeny table 
is 2x2x2x3x2. 

First we compare the goodness of fit of the 4-th order t-cherry junction tree found 
by Szantai-Kovacs's algorithm, then by Malvestuto's algorithm. 
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In Table[2]one can see the information contents of the marginal probabihty distribu- 
tion of 4 random variables, 3 random variables and the weights used in Szantai-Kovacs's 
algorithm, ordered in decreasing way. 



Table 2 Illustration of Szantai — Kovacs's algorithm 



Indices of the 
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Indices of the 
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The junction tree obtained by Szantai-Kovacs's algorithm has two clusters {1, 3, 4, 5}, 
{I, 2, 4, 5} and one separator {1, 4, 5}. The KL divergence in this case is: 

KL = / (X) - (/ (X{i^3,4^5}) - I (X{i,4,5}) + I (X{i,2,4,5})) 

= 0.19519 - (0.129381 - 0.047533 + 0.100251) = 0.013091. 

In Table [3] one can see the entropy of the marginal probability distribution of 4 
random variables, 3 random variables and the weights used in Malvestuto's algorithm, 
ordered in increasing way. 

The junction tree obtained by Malvestuto's algorithm has two clusters {1, 2, 3, 5} , 
{1, 3, 4, 5} and one separator {1, 4, 5}. The KL divergence in this case is: 

= -H (X) -I- H (X{i 2,3,5}) ~ H (X{i_3 5}) + H (X|i 3 4 5}) 
= -4.64164 + 3.288813 - 2.36849 + 3.743757 = 0.02244. 



Table 3 Illustration of Malvestuto's algorithm 



Indices of the 
cluster variables 


Indices of the 
separator variables 


H(Xc) 




H(Xc)-H(Xs) 


12 3 5 




3.288813 






13 4 5 


13 5 


3.743757 


2.368490 


1.375267 


2 3 4 5 


2 3 5 


3.783647 


2.406170 


1.377478 


12 3 4 


1 2 3 


3.943287 


2.563246 


1.380041 


12 4 5 


1 2 5 


4.046977 


2.615873 


1.431104 



The two results of KL divergence reflect that the junction tree obtained by our 
algorithm fits better to the probability distribution of the sample data. 
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If the task is fitting a third order t-cherry junction tree, then our algorithm finds a 
t-cheny junction tree probability distribution, with KL = 0.0355415. The third order 
t-cherry junction tree given by Malvestuto's algorithm has the KL — 0.0375077. The 
clusters found by our algorithm were {3,4,5}, {1,4,5}, {1,2,5} and those found by 
Malvestuto's algorithm were {1,3,5}, {1,2,5}, {3,4,5}. 



4 Theorems related to the Szantai-Kovacs's algorithm 

This part contains some theoretical discussions on the algorithm introduced, regarding 
to assumptions related to the Markov network underlying the variables. 

As we remind in the preliminary part a triangulated graph can be represented as 
a junction tree structure. If the graph is complete then the junction tree has only one 
cluster. 

If a graph is not triangulated, then by adding edges it can be transformed into a 
triangulated graph. The problem of ,,fill in as few edges as possible" is known to be 
NP complete ([21]). A greedy algorithm was given by Tarjan and Yanakakis 20]. 

If the vertices of a graph represent the indices of the random variables of a Markov 
network with global Markov property then by adding new edges to the graph results a 
Markov network having the global Markov property, too. 

If the graph associated to a Markov network is not complete then it can be trans- 
formed into a triangulated graph by adding edges which is equivalent with a junction 
tree structure, let say of order k. Since the global Markov property holds for this graph 
the probability distribution can be written as a product-division type, where the largest 
marginal probability distribution contains k variables. A logical question which arises 
here is if the greedy algorithm does find the fc-th order junction tree which gives the 
true probability distribution. For this question the answer is that under some assump- 
tion our greedy algorithm guaranties the optimal solution, which in this context is the 
true probability distribution. 

We need the following assertion: 

Lemma 1 H {Xi\X2, . . . , Xk) = H {Xi) ~ [I {Xi, . . . , Xk) - I {X2, . . . , Xk)]. 
Proof 

H {Xi\X2, ...,Xj.) = H {Xi,X2, ...,Xj,) - H {X2, ...,Xk) 

= H{Xi,X2,...,Xk)~Y.HiX,) 

1=1 




H(X2,...,Xk)~Y.HiX,) +H(Xi) 

i=2 / 



= {Xi,X2,...,Xk)-I{X2,...,Xk)). 

Remark 2 It is easy to see that maximizing / (Xi , . . . , X^. ) — J (X2 , ■ ■ ■ , X^ ) is the same 
as maximizing H [Xi) — H (Xi \X2, ■ ■ ■ , Xj.). 

We introduce the following notations. 

Let K, = {K = {ii, . . . , ij,} . . . , ifc £ V} be the set of all possible fc-element sub- 
sets of V. 

Let AIx : — > -R be defined as = maxj^g/f |/ (X/^') — / (Xj^'_|j_^}) } and let 
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K* = arg max Mk- (3) 

KeK 

We prove the following two theorems. 

Theorem 3 If X has a k-th order t-cherry tree representation then K* is a cluster of 
the junction tree. 

Proof We make the proof by contradiction. We suppose K* = {ii, . . . ^ C. Let 
us consider the smallest subjunction tree which contains all the vertices ii,. . . ,ij. at 
least once. In this subjunction tree one of the vertices ii, . . . is a simplicial vertex 
(a vertex which is contained in one cluster only). For simplicity let this vertex be ii 
and the cluster which contains it {ii, si, . . . , Sfc_i}, with {si, . . . , Sfc_i} 7^ {42, . . . , ife}. 
We emphasize here that it is not necessary that {si, . . . , sj,_i} n {12, . . . , ik\ ~ 4>- 

Since ii is a simplicial vertex Xi-^ depends on all the other random variables of the 
subjunction tree only through its neighbours Xsi , ■ ■ ■ , Xsf.^i , therefore 

H [Xi^ \Xs-i , . . . , Xsfe_i) < H (Xi^ \Xi2 , ■ ■ ■ , ^ifc) ■ 

Using Lemma [T] this inequality is equivalent to: 

H {Xi-^ ) — [/ (Xjj , Xsi , . . . , Xsfe_i) — / (Xsi , . . . , Xss,_i)] 
< H {Xi^ ) - [I (Xjj ,Xi^,. . . ,X,J ~ I {Xi^ Xi^ )] 

that is 

I (^ii 5 , Xsfc_i)— / {Xs-i , . . . , Xsfe_i) > / (Xi-^ , Xi^, . . . , Xi^)—I {Xi^ , . . . , Xj^) 

which is in contradiction with the hypothesis that {ii, . . . , zj,} — K* . 

In the following we introduce the so called puzzle-algorithm, wich results a special 
numbering of the verticies of t-cherry junction tree. 

Algorithm 4 Puzzle algorithm. 

Input: a fc-th order t-cherry juncton tree H {V, F), (acyclic hypergraph with edges 
of size k, and separators of size k-1) 

Output: a numbering {ii, . . . , i^} of the verticies of V — {1, . . . , d}. 
Step 1. Initialization. 

Let Bi € r, call it parent edge. The verticies belonging to the parent edge are 
numbered in an arbitrary order by ii, . . . , ij.. 

s := k, Ss := {Si-^ , . . . , Sii, }, where for j — 1, . . . , k, Si. are all the k — 1 
element subset of e^. 
Step 2. Iteration. 
Do r = r\ei. 

Do ii r ^ (f) then take ej £ F, which contains one of the elements S oi Ss- 
Set s s -f- 1, 
assign is to i = ei\S. 

Ss = Ss~i U {5*^^ , • • • , Sii^}, where for j = 1, . . . , k, Si. are all the k — 1 
element subset of e^. Go to Step 2. 
else Stop. 
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Definition 5 The numbering {ii, . . . , i^j} of the verticies of V = {1, . . . ,d} , obtained 
using Algorithm U is called puzzle numbering. 

Theorem 4 // the following two assumptions are fulfilled then the Szdntat-Kovdcs 
algorithm finds the true probability distribution. 

(i) The Markov network can be transformed into a k-th order t-cherry tree by adding 
some edges if it is necessary. 

(ii) Starting from the parent cluster defined by (0) there exists a puzzle numbering 
with the following property: for all ir < is and for any S G Sr 

H {X,^ ) - H (X,^ \S) < H (X,„) - H (X,„ |5,„) , 

where Si^ is the separator which separates ir from the tree containing the verticies 
{il,...,V_l}. 

Proof We proved in Theorem [3] that the cluster K* which satisfies ([S]) is a cluster of 
the junction tree associated to the Markov network. We choose this cluster as parent 
edge. 

Let us suppose that the Szantai-Kovacs Algorithm, has in the constructed junction 
tree already m—1 verticies. We denote this set of verticies by Vm~i- The set of possible 
separators at this end is 5m -i • 

The Szantai-Kovacs algorithm adds a new cluster by maximizing 

/(Xj^.XgJ - , where im G V\Vm-i and Si £ Sm-i 

According to Remark [T] this is equivalent with maximizing 

H {X,^) - H {X,^ \S^) , where i„ G V\Vm-i and Si G <S„_i. (4) 

We suppose now by contradiction that imis not connected to the existing junction 
tree through Si. Since the junction tree is a connected hypergraph, there exist two 
possibilities: 

1. im is separated from the existing tree T„i-i by another separator Sj G S,n~i', 

2. There exists i„ G V\Vm-i which is connected with the existing junction tree by 
Si G Sm-1, and the cluster {Si, in) is on the path between the existing tree Tm-i 
and the cluster which contains im. 

Now we pove that none of the two possibilities can occur. 

1. If im is separated from the existing tree T^ji—i t)y another separator Sj ^ Sjji—i 
then according to the global Markov property we have: 



This implies that the KuUback Leibler between 
P Xt,„_, and — 



is 0: 
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Thus 

I (Xs^X,„J - / (XsJ = / (Xt„_i,X,„) - / (Xt„_J . (5) 

On the other hand if Si does not separate im from the existing tree then the KL 
between 

P{Xt^_,,X,J and -^^^ 

is positive: 

KL = l{y.T^_^,X,J - {I {Xt,^_^) + I {Xs^,X,J ~ I (Xs,)) >0. 

Thus 

/(Xs,X,„)-/(X5j </(Xt_i,X,„J -/(Xt„_0. (6) 

From (O and P we have / (X^^X^^J - /(XsJ < / (Xg^X^^J - / (Xs^). 
According to Remark [T] this imphes 

H ) - H 1X5 J < Ji- ) - jXs^. ) 

which is in contradiction with maximization of Q. 
2. If on the path between the existing Tm-l tree and the cluster which contains 
im there exists a cluster {Si, in) , where in G l^\V^_i,andS'i £ Sm-l, then accord- 
ing to the puzzle numbering in < im ■ Using and (ii) we have: 

H {X,J ~ H (X.JS) < H {X,J ~ H {X,JS,J 

for any S G 5^-1, and Si^ £ Sm-l separator between inand the existing tree 
Tm-i • This is in contradiction with maximization of (|4]). 

Theorem 5 // the the best aproximating k-th order t-cherry probability dtstribution 
has a puzzle numbering which starting from the parent cluster defined by l&j) satisfies 
(i) and (ii) then the Szdntai-Kovdcs Algorithm finds the best aproximating k-th order 
t-cherry probability distribution. 

i) for all ir < is, for any S £ Sr, H {X,J - H {X,JS) < H {X,J - H (X.JS,^), 
where Si^ G Sr is the separator which separates ir from the tree containing the 
verticies {ii, . . . , V-l} 

ii) for all ir > k 

X,^=arg min KL (Papp iX,^, . . . , X,^_^, X,) , P (X,^, . . . , X,^_^, X,)) 

ieV\{ll,...,ir-l} 

Proof Let the cluster K* which satisfies ^ the first cluster of the junction tree. We 
choose this cluster as parent edge. 

Let us suppose that the Szantai-Kovacs Algorithm, has in the constructed junction 
tree already m — 1 verticies. The set of possible separators at this end is Sm-l ■ 

The Szantai-Kovacs algorithm adds a new cluster by maximizing 

/ (Xj^ , X5J - / (Xs^ ) , where im G V\Vm-i and S^ G Sm-l (7) 
According to Remark [T] this is equivalent with maximizing 

H {Xi^ ) - H (X,„ 15,) , where im G V\Vm-i and S, G Sm-l (8) 

We suppose now by contradiction that in the best approximating junction tree im 
is not connected to the existing junction tree through 5*^. Since the best approximating 
junction tree is a connected hypergraph there exist two possibilities: 
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1. im is separated from the existing tree Tm-l by another separator Sj € Sm~l', 

2. In the best approximating junction tree there exists in € V\Vm-i which is con- 
nected with the existing junction tree by Si £ Sm-i, and the cluster (Si, %n ) IS on 
the path between the existing tree and the cluster which contains im- 

Now we pove that none of the two possibilities can occur. 

1. If im is separated from the existing tree Tm—i by another separator Sj then ac- 
cording to the Markov property we have: 

pj X V 7^(Xt,„_0p(X5,x.J 

This implies that the KuUback Leibler between 

P (Xt,„ _ 1 , ) and Pipp (Xt,„ _ 1 , ) 

is given by: 

KL [Pipp (Xt„_i,X,„) ,P(Xt„_i,X,„)) 

= / (Xt„_i,X,„) - (/ (Xt„_ J + / (Xs,X,„) - I (X5J) 

According to (O 

7 (X5^. X,, J - / (X5 J < / (Xs. ) - / (Xs J 
and this implies that 
KL {Plpp (Xt„_ 1 , ) , P (Xt,„_ 1 ,X,^)) 

= I (Xt„_,,X,;„) - (J (Xt„_0 + /(Xs,X,,J - /(X5J) < KL (P^pp,p) . 

This is in contradiction with (ii). 

2. If on the path between the existing Tm-l tree and the cluster which contains 
imthere exists a cluster {Si,iii) , where in G V\V„i^i,SLndSi G iS„j_i, then accord- 
ing to the puzzle numbering in < im and (i) we have: 

H (X,^ ) - H \S)<H ) - H |5,„ ) 

for any 5 £ Sm-l, and Si^^ € Sm-l separator between inand the existing tree 
T,„_i . This is in contradiction with maximizing (jS]). 
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5 Conclusions 

We give in this paper a greedy algorithm for fitting fc- width junction tree approximation 
by minimizing the Kullback-Leibler divergence. The problem of finding the best ap- 
proximation of this kind is generally an NP-hard problem. We reduce the search space 
to the so called fc-th order t-cherry junction tree probability distributions. We then 
compare our algorithm to Malvestuto's algorithm. We proved that our algorithm in 
the first step finds a cluster which belongs to the junction tree. Malvestuto's algorithm 
has not guarantee for this. Beside this our formula for Kullback-Leibler divergence 
^ detached a greater part which does not depend on the structure of the tree than 
Malvestuto's formula ((2|. 

We proved that under some assumptions our algorithm finds the optimal solution. 

By discovering the t-cherry junction tree probability distribution assigned to a 
Markov network we can obtain many information on the dependence structure under- 
lying the random variables. This information can be used for storing the data in lower 
dimensional contingency tables. The method can be applied in classification problems 
where it is possible to select the "informative" variables which influence directly the 
classiflcation variable, see |19) . 
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